Exploring Prosper Loan Data by Pyari Singh K

Tip: For the explore and summarise data project, the propserLoanData data set is what I chose. The data set has 81 columns & 113937 observations.

Univariate Plots Section

Tip: In this section, you should perform some preliminary exploration of your dataset. Run some summaries of the data and create univariate plots to understand the structure of the individual variables in your dataset. Don’t forget to add a comment after each plot or closely-related group of plots! There should be multiple code chunks and text sections; the first one below is just to help you get started.

Loan Original Amount has a positive skew. The mean value for the LoanOriginalAmount is 8337 which is greater than the median. The median for the LoanOriginalAmount is 6500.

Monthly Loan Payment has a positive skew. The median is less than the mean for this data. The median for the monthly Loan Payment is 217 and the mean is 272.

credit score range lower has a normal distribution. The mean/median in this case is 685.56/680

BorrowerRate also has a normal distribution and the mean/med ratio is 0.19/0.18 However, there are various spikes especially between 0.24 and 0.31.

Debt To Income Ratio exhibits a slight positive skew with mean > median. Mean is 0.275 and median is 0.22. There is a strong observation here. DI ratio is less than 0.5 for a good percentage of pepole.

Stated Monthly Income has a positive skew with mean > median. Mean for the Stated Monthly Income is 5608 and the median is 4666.

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max.     NA's 
##      0.0    637.5   1061.5   1224.9   1622.4 171004.2     8554

MonthlyDebtTotal has been added as a new column. This was derived by multiplying the debt to Income Ratio to the stated Monthly Income.

Total Monthly debt has a positive skew on its histogram and the mean is greater than the median. The mean is 1224. The median is 1061. The first quartile for this data is 637 and the third quartile is 1622.

The Employment Status Duration exhibits a positive skew. The mean of the data is higher than the median. The mean is 96 and the median is 67. It is observed that more number of employees worked for shorter durations.

Current Credit Lines has a slight positive skew. The mean/median ratio is 10.31/10.

Loans originate mostly during October, December and January

During the year 2009, there was a drop. This can be attributed to the economic crisis during that year.

## 
##              Cancelled             Chargedoff              Completed 
##                      5                  11992                  38074 
##                Current              Defaulted FinalPaymentInProgress 
##                  56576                   5018                    205 
##   Past Due (>120 days)   Past Due (1-15 days)  Past Due (16-30 days) 
##                     16                    806                    265 
##  Past Due (31-60 days)  Past Due (61-90 days) Past Due (91-120 days) 
##                    363                    313                    304

Most of the loans are either in the completed status or in current status.

##             $0      $1-24,999      $100,000+ $25,000-49,999 $50,000-74,999 
##            621           7274          17337          32192          31050 
## $75,000-99,999  Not displayed   Not employed 
##          16916           7741            806

Most of the loans are taken by people with income range between 25k USD to 50k USD and 50k to 75k USD.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   12.00   36.00   36.00   40.83   36.00   60.00
## 
##    12    36    60 
##  1614 87778 24545

Most of the loans are of duration 36 months (3 years)

Top 10 states by the number of borrowers were plotted and this the state California topped the list.

Since the most of the loans belonged to the category of “Other”, the data is not fully sufficient to analyse people with what kind of occupation go for the highest loans.

## [1] "Borrowers - Are they home owners?"
## False  True 
## 56459 57478

Almost 50% of the borrowers are home owners

Employment Status data doesnt seem to be complete. Also, as anticipated, the employed categroy go for the highest number of loans.

Tip: Make sure that you leave a blank line between the start / end of each code block and the end / start of your Markdown text so that it is formatted nicely in the knitted text. Note as well that text on consecutive lines is treated as a single space. Make sure you have a blank line between your paragraphs so that they too are formatted for easy readability.

Univariate Analysis

Tip: Now that you’ve completed your univariate explorations, it’s time to reflect on and summarize what you’ve found. Use the questions below to help you gather your observations and add your own if you have other thoughts!

What is the structure of your dataset?

What is/are the main feature(s) of interest in your dataset?

What other features in the dataset do you think will help support your
investigation into your feature(s) of interest?

Did you create any new variables from existing variables in the dataset?

Of the features you investigated, were there any unusual distributions?
Did you perform any operations on the data to tidy, adjust, or change the form
of the data? If so, why did you do this?

Bivariate Plots Section

Tip: Based on what you saw in the univariate plots, what relationships between variables might be interesting to look at in this section? Don’t limit yourself to relationships between a main output feature and one of the supporting variables. Try to look at relationships between supporting variables as well.

Bivariate Analysis

## Warning: Removed 16702 rows containing missing values (geom_point).

## $title
## [1] "Current Credit Lines Vs Total Monthly Debt"
## 
## $subtitle
## NULL
## 
## attr(,"class")
## [1] "labels"

There is a clear positive correlation between Current CreditLines and the Total Monthly Debt and the value of the correlation coeffient is 0.47.

## Warning: Removed 766 rows containing missing values (geom_point).

There is a clear negative correlation betwen the CreditScores and the borrower rate. The correlation coefficient is -0.46. Persons with a lower credit scores get the loans at a lower borrower rate.

## Warning: Removed 11195 rows containing missing values (geom_point).

There is a positive correlation between the stated Monthly Income and the total monthly debt. The correlation coefficient is 0.36.

The correlation between the loan Amount and Borrower Rate is negative. The value of the correlation coeffient is -0.32. This would mean that the higher loans are disbursed at a lower interest rate.

## Warning: Removed 591 rows containing missing values (geom_point).

There is a slight positive correlation between loan Amount and Credit Scores. Higher the loans, the credit scores can increase.

## Warning: Removed 11189 rows containing missing values (geom_point).

Monthly Income Vs Total Monthly Debt was plotted and was facet wrapped by IsBorrowerHomeowner.The dispersion is higher in the case of home owners and for the non home owners, the dispersion is concentrated more amoung the borrowers with an income of around 5000.

The median amount dips in 2009 and has a sharp rise in 2013.

Dec to Feb is the period when the loan amounts are usually higher.

Tip: As before, summarize what you found in your bivariate explorations here. Use the questions below to guide your discussion.

Talk about some of the relationships you observed in this part of the
investigation. How did the feature(s) of interest vary with other features in
the dataset?

Did you observe any interesting relationships between the other features
(not the main feature(s) of interest)?

What was the strongest relationship you found?

Multivariate Plots Section

Tip: Now it’s time to put everything together. Based on what you found in the bivariate plots section, create a few multivariate plots to investigate more complex interactions between variables. Make sure that the plots that you create here are justified by the plots you explored in the previous section. If you plan on creating any mathematical models, this is the section where you will do that.

Multivariate Analysis

Talk about some of the relationships you observed in this part of the
investigation. Were there features that strengthened each other in terms of
looking at your feature(s) of interest?

Were there any interesting or surprising interactions between features?

OPTIONAL: Did you create any models with your dataset? Discuss the
strengths and limitations of your model.


Final Plots and Summary

Tip: You’ve done a lot of exploration and have built up an understanding of the structure of and relationships between the variables in your dataset. Here, you will select three plots from all of your previous exploration to present here as a summary of some of your most interesting findings. Make sure that you have refined your selected plots for good titling, axis labels (with units), and good aesthetic choices (e.g. color, transparency). After each plot, make sure you justify why you chose each plot by describing what it shows.

Plot One

Description One

Plot Two

Description Two

Plot Three

Description Three


Reflection

Tip: Here’s the final step! Reflect on the exploration you performed and the insights you found. What were some of the struggles that you went through? What went well? What was surprising? Make sure you include an insight into future work that could be done with the dataset.

Tip: Don’t forget to remove this, and the other Tip sections before saving your final work and knitting the final report!